A Bayesian framework to integrate multi-level genome-scale data for Autism risk gene prioritization

Background Autism spectrum disorder (ASD) is a group of complex neurodevelopment disorders with a strong genetic basis. Large scale sequencing studies have identified over one hundred ASD risk genes. Nevertheless, the vast majority of ASD risk genes remain to be discovered, as it is estimated that more than 1000 genes are likely to be involved in ASD risk. Prioritization of risk genes is an effective strategy to increase the power of identifying novel risk genes in genetics studies of ASD. As ASD risk genes are likely to exhibit distinct properties from multiple angles, we reason that integrating multiple levels of genomic data is a powerful approach to pinpoint genuine ASD risk genes. Results We present BNScore, a Bayesian model selection framework to probabilistically prioritize ASD risk genes through explicitly integrating evidence from sequencing-identified ASD genes, biological annotations, and gene functional network. We demonstrate the validity of our approach and its improved performance over existing methods by examining the resulting top candidate ASD risk genes against sets of high-confidence benchmark genes and large-scale ASD genome-wide association studies. We assess the tissue-, cell type- and development stage-specific expression properties of top prioritized genes, and find strong expression specificity in brain tissues, striatal medium spiny neurons, and fetal developmental stages. Conclusions In summary, we show that by integrating sequencing findings, functional annotation profiles, and gene-gene functional network, our proposed BNScore provides competitive performance compared to current state-of-the-art methods in prioritizing ASD genes. Our method offers a general and flexible strategy to risk gene prioritization that can potentially be applied to other complex traits as well. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04616-y.


SUPPLEMENTARY NOTES A. Prior odds that a gene being an ASD risk gene: network construction
We follow previous work [1,2] to construct a gene-gene network to represent each gene's prior odds of being an ASD gene. The following are the three major steps in this process.

A.1. Construct an adjacency matrix from Gene Ontology (GO) annotations
We created a weighted network from publicly available GO annotations. Each GO annotation term includes a gene product and a molecular function, biological process, or cellular component. First, we assign a weight for each GO annotation term T (e.g., a molecular function) based on how specific the term is: measured by the number of genes associated with the term (N T ) divided by total number of genes (N total ) with annotations.
S T = N T N total (S1) Then, for distance between gene i (g i ) and gene j (g j ), we assign a score proportional to the log of the ratio of the likelihood that the two genes participate in the same GO annotation to the likelihood that they do not by summing all the terms shared by these two genes [1].
Then we column normalized matrix W to make them unit-length, and used the resulting matrix W as adjacency matrix.

A.2. Construct a transition matrix from adjacency matrix
From the symmetric adjacency matrix W, we derive a transition matrix P. The rows of the transition matrix are probability vectors, in other words, the rows of matrix P are numerical vectors whose entries are real numbers between 0 and 1 whose sum is 1. We use P g i ,g j to reflect the probability of stepping to node j from node i informed by the adjacency matrix W.

A.3. Apply random walk with restart algorithm to compute distance
Here we use Random walk with restart (RWR) to measure how closely related are two genes (i.e. nodes in a weighted graph) in the transition matrix P ( [2,3]). For each step starting from any node i of the network, the walker has two options: either move to a neighbor with a probability 1 − r or stay at node i with a probability r. The parameter r is called the restart probability. In the current study, we fix r = 0.3. The probability of the walker to walk to each neighbor is proportional to the weight of the edge connecting them. Let q t denote a vector with the reaching probability to all nodes at step t starting from node i.
S i is an indicator vector with the ith element as 1 and 0 for others, which means the starting node is node i. q t can be updated step by step until |q t+1 − q t | < T, where T is a predefined threshold. We set T to 1e −6 in this study.
For each gene i, let q t be the vector with the reaching probabilities to all the genes (after stabilization), let P(N p ) be the average reaching probabilities to all positive training genes, which is proportional to the prior odds of gene i being a risk gene.

B.1. Binary annotations
We assume that a binary annotation D l follows the Bernoulli distribution Bernoulli(θ lj ), where θ lj represents the fraction of disease-associated genes with this annotation under model M j (j = 0, 1).
To estimate the parameter θ ij , suppose we have a prior distribution of the parameter (determined by hyperparameters of beta distribution), and we observe the training set genes along with their annotations information and labels (disease-associated/nondisease-associated), we use empirical Bayes approach to estimate the distribution of these parameters and hyperparameters.

• Prior distribution of θ ij
To obtain conjugate prior for θ, we use Beta distribution here.
From the training data, we can update the distribution of θ ij Suppose we have n seed genes, k of which possess this binary annotation, i.e., ∑ n g=1 D lg = k. For a single observation, the marginal likelihood is The moments of Beta-Bernoulli distributions are: Let the first moment of D be A 1 = 1 n ∑ n g=1 D k l and we equate E(D l ) to A 1 . As we have one equation and two parameters to estimate, we assumeβ i1 = 1 so that we can get estimateα i1 by solving E(D l ) = A 1 , the resulting estimate for hyperparameters:β With these in hand, the estimated mean of θ l1 : Similarly, under M 0 , we can derive the posterior estimate of α i0 and β i0 from background genes.

B.2. Continuous annotations
We assume that a continuous annotation D l follows a normal distribution N(µ lj , θ lj ), where µ lj and θ lj are the mean and variance, respectively, under model M j (j = 0, 1).
For D ∼ t υ (µ, σ 2 ), it was shown that the first six moments of D are [5]: if υ > 2: if υ > 6: Here we can obtain estimators for µ if we setκ = 1 We use the posterior distribution of D and method of moments to estimate prior parameters under M 1 using seed genes and parameters under M 0 using background genes. Let A k = 1 n ∑ n g=1 D k g (kth moment of D), we have: